This blog dives deep — into the state of hospital AI. It's broken into 12 parts and covers everything from real-world deployments to benchmarks, bad data, and the terrifying gaps in safety testing.
If you've got time, settle in.
If not? Just copy-paste the whole thing into ChatGPT and ask for a summary. It won't judge you (probably).
Introduction
OpenAI just dropped HealthBench - a medical AI evaluation system with input from 250+ physicians. Current top AI models score around 60%. Meanwhile, 80% of hospitals now use AI to improve patient care and operational efficiency, deploying AI triage agents that handle emergency calls, book appointments, and make split-second decisions about who needs urgent care. The disconnect is terrifying.
Before we jump into the cold, clinical world of hospital AI, a quick note:
This blog is written by an actual human (hi, I'm Deepesh), not a language model, chatbot, or sentient EHR system. I drink too much coffee, I have opinions, and yes — I still get put on hold when I call my doctor.
If this post feels a little skeptical, it's because I am.
If it feels like I'm yelling into the void, it's because… well, I am.
Time to meet the hospital AIs already making life-and-death calls — deciding who gets care, and who gets dismissed with a chatbot's shrug.
Meet the Hospital AI Agents Actually Running Healthcare Right Now
The Real Players (Not the Marketing Fluff)
You've probably seen the headlines — "AI is revolutionizing healthcare!" — but what's actually running inside hospitals today is a lot messier, weirder, and, in some cases, quietly terrifying.
Behind the buzzwords, there are real systems being used to triage patients, route emergency calls, and decide who gets seen now vs. later. And no, they're not built by scrappy garage startups — many of them come from major players in healthcare and AI.
Here are a few of the AI agents quietly doing clinical work behind the scenes:
TriageGO (From Johns Hopkins to Beckman Coulter, 2022-2025)
- Handles over 200,000 emergency visits every year
- Helped cut down the time it takes to get patients from the door to the ICU by 40 to 80 minutes
- To put that in perspective: people used to wait over an hour just to get into intensive care
- On the tech side, it scores pretty well (AUROC 0.899-0.962) when predicting who needs intervention
- What that means in plain English: it's definitely better than flipping a coin, but still not quite as sharp as an experienced triage nurse
KATE AI (By Mednition)
- Their marketing says it's "built specifically for nurses with zero workflow changes"
- Translation: they know nurses will absolutely revolt if you make their job harder with extra clicks
- It's rolled out under different names across various health systems — kind of like a secret AI handshake
- Claimed to be "expert-validated," but who exactly the experts are and how it was tested? That's a bit fuzzy
Clearstep
- They boldly claim over 95% accuracy compared to ER physician judgment
- Built on Schmitt clinical content — the same stuff that about 95% of call centers rely on
- The catch? It's basically the same decision trees your insurance company uses to deny claims
- Oh, and it's white-labeled, so hospitals often don't even realize they're using it
2025 Reality Check
43% of healthcare leaders are already using AI for in-hospital patient monitoring, while 40% of healthcare providers reported improved efficiency due to AI solutions. 68% of physicians reported recognizing at least some advantage of AI in patient care, up from 63% in 2023.
But here's the kicker: hospital AI implementation is significantly clustered, with over 67% misaligned with critical healthcare needs like provider shortages.
HealthBench — The 48,562-Criteria Monster That's Shaking Things Up
So, What Does HealthBench Even Test?
HealthBench isn't just your run-of-the-mill medical benchmark. It's more like the ultimate AI exam — designed to answer the big question: "Can this AI actually handle medical conversations without accidentally doing harm?"
Here's the rundown in numbers:
- 5,000 real back-and-forth medical conversations
- Built with input from 250+ physicians
- A mind-boggling 48,562 criteria used to judge the AI
- Of those, 34 are the "consensus criteria" — basically the gold standard
- On average, each conversation is scored on about 11 criteria
- The best AI model out there right now hits around 60% — so, not exactly acing the test
How Does HealthBench Break It Down?
It works on two levels:
- Example-specific criteria (about 86%) — these are unique, scenario-by-scenario tests written just for each conversation
- Consensus criteria (14%) — a fixed set of standards agreed upon by multiple doctors that really matter
The 34 "Consensus" Criteria — The Real Deal
These 34 criteria are grouped into 7 big themes, and honestly, they're the ones every hospital AI should be sweating about:
Emergency Referrals
- Can the AI shout, "GET TO THE ER NOW!" when it really counts (like crushing chest pain)?
- Does it know the difference between urgent and "call an ambulance"?
- Or worse — will it send hypochondriacs rushing to the ER over a paper cut?
Tailoring Communication to Expertise
- Can it switch between doctor-speak and plain English?
- Does it know when to drop the jargon?
- Does it get cultural differences in communication?
Handling Uncertainty
- Does it ask for more info when it needs to?
- Can it hedge its bets without sounding clueless?
- Most importantly, can it avoid dangerous wild guesses?
Response Depth
- Can it keep it short and sweet for simple questions?
- Or dive deep when the case is complicated?
- Does it know when to stop talking?
Health Data Tasks
- Can it manage structured medical info, like booking or records?
- Does it safely say "I don't have enough data" when appropriate?
Understanding Global Health Context
- Does it adapt to different healthcare systems?
- Take resource availability into account?
- Respect cultural medical practices?
Context-Seeking
- Does it ask for missing critical info?
- Avoid turn-the-screw interrogation?
- And know when enough is enough?
The Michael Riegler Bombshell — Why HealthBench Might Actually Be Broken
The Researcher Who Took a Closer Look (And Found Some Shocking Stuff)
Michael Alexander Riegler, the AI lead at Simula Research Laboratory, did something pretty rare in academic circles: he actually read the fine print and analyzed the data behind HealthBench. Spoiler alert — what he found isn't great.
The "Multi-Turn" Lie:
HealthBench markets itself as testing conversational AI with multi-turn interactions, but turns out, 58.3% of the conversations are just single-turn. That's like saying you're training for a marathon but mostly running to the mailbox.
The Contamination Scandal:
Riegler found all kinds of non-medical junk in the dataset, including:
- Adult and sexual content (yikes)
- Random trivia like "What's the capital of France?"
- Requests for help debugging CSS code
- Pasta recipes (because why not?)
- General trivia questions
So part of the benchmark actually tests how well medical AI can say "Nope, I won't help you cook dinner." Wild, right?
The Redundancy Problem:
About 13% of the examples are basically duplicates — the same or very similar criteria tested twice. In academic speak, "We're testing the same thing twice but counting it as different stuff."
Riegler's Takeaway:
He says that models optimized to do well on HealthBench might give great initial answers but then struggle to keep the conversation going — which means they look better on paper than they would in real clinical situations.
Put simply: your AI might crush the test but totally flop when it actually meets a patient.
Emergency vs. Non-Emergency — Where Hospital AI Often Gets Stuck
The Reality Behind the Emergency Severity Index (ESI)
So here's the deal: most hospital AI systems rely on something called the Emergency Severity Index (ESI) to figure out how urgent a patient's condition is. It breaks things down like this:
- Level 1: Actively dying — this is a code red situation
- Level 2: High risk — serious, but not quite code red yet
- Level 3: Stable, but needs multiple resources — basically, the "maybe something's wrong, maybe not" zone
- Level 4: Stable, single resource — minor issues
- Level 5: Seriously, why are you here?
Here's the kicker: 60-70% of patients end up in Level 3 — that vague middle ground where the AI and the clinicians basically shrug and say, "Hmm, could go either way."
How AI Systems Actually Handle Triage
Take TriageGO at Johns Hopkins, for example:
- They claim it shaves off 20-30 minutes from the time it takes to make a clinical decision
- Their approach? Label more patients as "non-urgent" to speed things up
- What happened next? A whopping 48.2% increase in low-acuity visits after the AI was introduced
- So… is it better at spotting the truly sick, or just telling a lot of people "You're fine, go home"?
The Metrics They Use
AI systems base their decisions on things like:
- Vital signs (numbers that mostly don't lie — unless they do)
- Medical history pulled from electronic records (assuming the records are up to date)
- Symptoms reported by patients (which depends on how well someone remembers or explains)
- Risk factors (if patients actually admit to them)
One system even boasts an AUROC of 0.836 for predicting tachycardia — which is about as accurate as a fresh medical student who just learned what tachycardia means yesterday.
US vs. India - Same AI, Completely Different Medical Universes
The Regulatory Reality Check
In the US — The Land of Paperwork:
- 691 AI/ML-enabled medical devices approved by the FDA as of October 2023, with more added regularly through 2025
- To get there, they follow a "substantial equivalence" or De Novo approval path (fancy words for "prove you're safe and kinda like what's already out there")
- New rules from January 2025 demand transparency, explainability, and bias checks — basically, trying to make sense of neural networks for regulators
- Reality? Clinical performance studies were reported for approximately half of the analyzed devices, while one-quarter explicitly stated that no such studies had been conducted
In India — The "Better Than Nothing" Mindset:
- Only 64 doctors per 100,000 people (compared to a global average of 150)
- Zero AI devices classified as high-risk have CDSCO approval yet
- The AI healthcare market is booming though — expected to hit $1.6 billion by 2025 with a crazy 40.5% annual growth
- The big question here: Can AI do better than having no doctor at all?
Why HealthBench Misses the Mark
HealthBench was put together by docs from multiple countries, which sounds great — but it doesn't actually adapt to local realities.
An AI trained on HealthBench might:
- Suggest a PET scan in a village where there's no PET scanner anywhere near
- Push for insurance pre-authorization in places where insurance isn't really a thing
- Assume emergency rooms are open 24/7 (spoiler: not everywhere)
- Use fancy Western medical jargon that leaves patients scratching their heads in non-Western countries
The Real Deal on Testing Your Hospital's AI
What It Takes to Test AI with HealthBench
Thinking about putting your hospital's AI through the HealthBench wringer? Here's what you're really signing up for:
git clone https://github.com/openai/simple-evals
cd simple-evals
pip install -r requirements.txt
python -m simple_evals.healthbench_eval --model your-model
Sounds simple enough — but heads up on the costs and complexity:
- Expect to pay $50 to $200 every time you run a full evaluation
- The grading is done by GPT-4.1 AI judging AI (talk about robots grading robots)
- The "physician-level agreement" score sits around 0.709 macro F1
- Translation? That's roughly how often real doctors even agree with each other — so don't expect perfection
Other Ways to Run It
If you want to experiment a bit differently, there's a community-supported version on HuggingFace:
python easybench.py --model-endpoint http://localhost:8000/v1 \
--model-key your-key \
--judge-name gpt-4o \
--dataset main
What You Actually Get from This
- An overall score (FYI: top models today score around 60%)
- A breakdown of how your AI performs across different themes and tasks
- Analysis of the worst-case scenarios — where things could really go sideways
- A detailed look at what kinds of mistakes your AI tends to make
The Benchmark Graveyard — Why Most Old Tests Are Now Basically Useless
Benchmarks That Have Seen Better Days

MedQA (USMLE-style):
- Top model score: 96.9% (the o1 model)
- Sounds impressive, right? Better than most med students!
- But here's the catch: multiple-choice tests are not the same as real doctor-patient chats
MMLU-Medical:
- Covers: 265 clinical knowledge and 272 professional medicine questions
- Status: Completely cracked by today's AI models
- How useful is it now? About as useful as testing if water is wet
PubMedQA:
- Focus: Understanding biomedical research papers
- Data: 1,000 expert-curated Q&A pairs
- Great for nerding out on research
- Not so great when a panicked parent calls at 3 AM asking about their kid's chest pain
The Big Problem?
All these benchmarks are saturated. Your AI might ace a medical licensing exam with flying colors — but still can't tell if chest pain in a 25-year-old is an emergency or if a 75-year-old's pain needs immediate attention. Real-world judgment? Still a long way to go.
Real-World Deployments That Should Keep You Awake at Night
What AI is Actually Running in Hospitals Right Now
Yale New Haven Health:
- Running TriageGO in 3 emergency departments
- Chris Chmura, the system manager, stresses it's all about the "Human + AI combo"
- Translation: The AI still messes up enough that they can't just leave it alone—humans have to watch its every move
Mayo Clinic:
- Partnered with Google Cloud AI
- Uses AI mostly for crunching numbers—like kidney disease calculations and breast cancer risk assessments
- Takeaway: Even a world-class hospital like Mayo isn't ready to hand over conversations to AI
TidalHealth Peninsula Regional:
- Uses IBM Watson
- Success? Cut clinical search times from 3-4 minutes down to less than a minute
- Side note: This is the same Watson that famously crashed and burned at MD Anderson after a $62 million investment
The 2025 Healthcare AI Statistics
According to recent 2025 healthcare AI surveys:
- 94% of healthcare providers, life science companies, and tech vendors now use AI in some capacity
- 68% of physicians reported recognizing at least some advantage of AI in patient care, up from 63% in 2023
- 65% of Primary Care Physicians (PCPs) agree that AI solutions should be developed by their EHR vendor and built directly into their workflows
- 84% of physicians consider strong EHR integration a top requirement for adopting AI tools in practice
The Evaluation Gaps Nobody Talks About
What HealthBench Can't Measure
The Single Response Problem:
Real patient relationships involve ongoing conversations, not one-shot interactions.
Workflow Integration Blindness:
Can't evaluate how AI fits into the click-heavy EMR nightmare.
Zero Outcome Measurement:
High benchmark scores ≠ healthier patients.
Specialty Ignorance:
Dermatologists and psychiatrists would laugh at the same "comprehensive" criteria.
Michael Riegler's Warning
"Never trust a benchmark that you did not design yourself."
The 250+ physicians who created HealthBench did their best, but they can't capture:
- Every edge case
- Cultural nuances
- The reality that sometimes the best medical advice is "I don't know, let me find someone who does"
The Future of Hospital AI Evaluation
The Good News
HealthBench is leagues better than previous benchmarks:
- Tests actual conversational ability (mostly)
- Includes critical safety criteria
- Hasn't been completely conquered by current models
- Provides framework for meaningful evaluation
The Bad News
We're deploying hospital AI agents that are:
- Evaluated against benchmarks including cooking recipes
- Scoring 60% in systems where lives depend on accuracy
- Deployed across countries with vastly different healthcare realities
- Using criteria that treat medical care in Bangalore and Boston as identical
The Ugly Truth
Your hospital's AI triage system was probably:
- Evaluated on multiple-choice questions
- Deployed because it was cheaper than hiring nurses
- One edge case away from telling someone with a heart attack to take aspirin and call back tomorrow
Quick Start : Everything You Need to Get Started
Evaluate Directly via Whispey
- Quick Integration: https://www.whispey.xyz/docs/sdk/healthbench-evaluation
Official HealthBench Access
-
https://github.com/openai/simple-evals
-
https://huggingface.co/datasets/OnDeviceMedNotes/healthbench
-
https://arxiv.org/abs/2505.08775
-
https://docs.google.com/spreadsheets/d/104D1Yod6Liem4Ad7sJWOuYfE4LzIcdcI2mIKNfMH4Y4/edit?usp=sharing
Michael Riegler's Critical Analysis
-
https://medium.com/@michael_79773/a-closer-look-at-openais-new-healthbench-evaluation-benchmark-ed3455110a29
-
https://colab.research.google.com/drive/1ROsxGAgsaq_2ThuMbfwhkL0IkTYH2CoB?usp=sharing
-
https://colab.research.google.com/drive/1rs4lqSGwXzzgObrf6tLkei9kRKIMJEfI?usp=sharing
-
https://drive.google.com/file/d/1yPZhYySPcA4HktkTew2AbKDWP3AnI7hV/view?usp=sharing
The Uncomfortable Truth: We're Flying Blind
HealthBench — with its 48,562 evaluation criteria, 250+ physicians, and 11 months of effort — is the most ambitious attempt yet to answer a scary question:
Can AI handle medical conversations without getting people hurt?
The bad news?
It's flawed. Really flawed.
It includes pasta recipes, CSS debugging, and trivia questions in what's supposed to be a medical benchmark. Over half the "conversations" are actually just one-off questions. It assumes all countries have the same healthcare infrastructure (spoiler: they don't). And it's trying to measure good medicine in a world where we don't even agree on what that is.
The worse news?
It's still the best we've got.
Right now, the AI tools helping decide your care — whether it's TriageGO, KATE, Clearstep, or something your hospital duct-taped together with its EHR system — are being evaluated on tests that would barely pass a third-year med school exam. And we're deploying them anyway, because speed beats certainty in modern healthcare.
The 34 "consensus criteria" in HealthBench are gold — they're what every medical AI should be tested against. But as Michael Riegler showed, intentions don't always survive implementation. And when your AI gets graded on fluff, it might look safe… until it's not.
So Where Are We, Really?
We're in a world where:
- 60% accuracy is the high score
- AI is making real clinical decisions
- Evaluation frameworks are held together with duct tape and best intentions
- And "good enough" could mean someone lives… or doesn't
This doesn't mean we should ditch medical AI — far from it. The right systems do save time, reduce burnout, and catch things humans miss. But they must be tested properly. Transparently. In context. With room for nuance and failure modes, not just leaderboards and benchmarks.
Until Then…
Until we build better ways to evaluate AI, every hospital is gambling. They're betting that the AI they just installed will help more than hurt — based on tests that don't reflect the real world it's walking into.
HealthBench set the bar at 48,562 criteria. The best models are clearing maybe 29,000.
And patients are living in that gap.